In [ ]:
import numpy as np
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sb
sb.set_style('whitegrid')

import requests
import json
import re
from collections import Counter
from bs4 import BeautifulSoup

import string
import nltk

import networkx as nx

Fundamentals of text processing

Content in this section is adapted from Ramalho (2015) and Lutz (2013).

The most basic characters in a string are the ASCII characters. The string library in Python, helpfully has these all listed out.


In [ ]:
string.ascii_letters

There's also punctionation and digits


In [ ]:
string.punctuation

In [ ]:
string.digits

A string is basically a list of these mapping numbers. We can use some other functions and methods to analyze a string like we would with other iterables (like a list).

The len of a string returns the number of characters in it.


In [ ]:
len('Brian')

Two (or more) strings can be combined by adding them together.


In [ ]:
'Brian' + ' ' + 'Keegan'

Every character is mapped to an underlying integer code.


In [ ]:
ord('B')

In [ ]:
ord('b')

We can also use chr to do the reverse mapping: finding what character exists at a particular numeric value.


In [ ]:
chr(66)

When you're doing comparisons, you're basically comparing these numbers to each other.


In [ ]:
'b' == 'B'

Here's the first 128 characters. Some of these early characters aren't single characters, but are control characters or whitespace characters.


In [ ]:
[(i,chr(i)) for i in range(128)]

You'll notice that this ASCII character mapping doesn't include characters that have accents.


In [ ]:
s = 'Beyoncé'

This last character é also exists at a specific location.


In [ ]:
ord('é')

In [ ]:
chr(233)

However, the way that Python performs this mapping is not the same for computers everywhere else in the world. If we use the popular UTF-8 standard to encode this string into generic byte-level representation, we get something interesting:


In [ ]:
b = s.encode('utf8')
b

The length of this b string somehow got a new character in it compared to the original s string.


In [ ]:
print(s,len(s))
print(b,len(b))

If we try to discover where these characters live and then map them back, we run into problems.


In [ ]:
ord(b'\xc3'), ord(b'\xa9')

In [ ]:
chr(195), chr(169)

We can convert from this byte-level representation back into Unicode with the .decode method.


In [ ]:
b.decode('utf8')

In [ ]:
ord(b'\xc3\xa9'.decode('utf8')), chr(233)

Using a different decoding standard like CP1252 returns something much more grotesque without throwing any errors.


In [ ]:
b.decode('cp1252')

There are many, many kinds of character encodings for representing non-ASCII text.

This cartoon pretty much explains why there are so many standards rather than a single standard:

  • Latin-1: the basis for many encodings
  • CP-1252: a common default encoding in Microsoft products similar to Latin-1
  • UTF-8: one of the most widely adopted and compatible - use it wherever possible
  • CP-437: used by the original IBM PC (predates latin1) but this old zombie is still lurking
  • GB-2312: implemented to support Chinese & Japanese characters, Greek & Cyrillic alphabets
  • UTF-16: treats everyone equally poorly, here there also be emojis

Other resources on why Unicode is what it is by Ned Batchelder, this tutorial by Esther Nam and Travis Fischer, or this Unicode tutorial in the docs.


In [ ]:
for codec in ['latin1','utf8','cp437','gb2312','utf16']:
    print(codec.rjust(10),s.encode(codec), sep=' = ')

You will almost certainly encounter string encoding problems whenever you work with text data. Let's look at how quickly things can go wrong trying to decode a string when we don't know the standard.

Some standards map the \xe9 byte-level representation to the é character we intended, while other standards have nothing at that byte location, and still others map that byte location to a different character.


In [ ]:
montreal_s = b'Montr\xe9al'

for codec in ['cp437','cp1252','latin1','gb2312','iso8859_7','koi8_r','utf8','utf16']:
    print(codec.rjust(10),montreal_s.decode(codec,errors='replace'),sep=' = ')

How do you discover the proper encoding given an observed byte sequence? You can't. But you can make some informed guesses by using a library like chardet to find clues based on relative frequencies and presence of byte-order marks.

What is your system's default? More like what are the defaults. The situation on PCs is generally a hot mess with Microsoft's standards like CP1252 competing with international standards like UTF-8, but Mac's generally try to keep everything in UTF-8.


In [ ]:
import sys, locale

expressions = """
locale.getpreferredencoding()
my_file.encoding
sys.stdout.encoding
sys.stdin.encoding
sys.stderr.encoding
sys.getdefaultencoding()
sys.getfilesystemencoding()
"""

my_file = open('dummy', 'w')

for expression in expressions.split():
    value = eval(expression)
    print(expression.rjust(30), '=', repr(value))

Whenever you encounter problems with character encoding issues and you cannot discover the original encoding (utf8, latin1, cp1252 are always good ones to start with), you can try to ignore or replace the characters.


In [ ]:
montreal_s.decode('utf8')

In [ ]:
for error_handling in ['ignore','replace']:
    print(error_handling,montreal_s.decode('utf8',errors=error_handling),sep='\t')

Unfortuantely, only tears, fist-shaking, and hair-pulling will give you the necessary experience to handle the inevitability of character encoding issues when working with textual data.

Loading Wikipedia biographies of Presidents

Load the data from disk into memory. See Appendix 1 at the end of the notebook for more details.


In [ ]:
with open('potus_wiki_bios.json','r') as f:
    bios = json.load(f)

Confirm there are 44 presidents (shaking fist at Grover Cleveland, the 22nd and 24th POTUS) in the dictionary.


In [ ]:
print("There are {0} biographies of presidents.".format(len(bios)))

What's an example of a single biography? We access the dictionary by passing the key (President's name), which returns the value (the text of the biography).


In [ ]:
example = bios['Grover Cleveland']
print(example)

We are going to discuss how to process large text documents using athe Natural Language Toolkit library. ​ We first have to download some data corpora and libraries to use NLTK. Running this block of code should pop up a new window with four blue tabs: Collections, Corpora, Models, All Packages. Under Collections, Select the entry with "book" in the Identifier column and select download. Once the status "Finished downloading collection 'book'." prints in the grey bar at the bottom, you can close this pop-up. ​

You should only need to do this next step once for each computer you're using NLTK.


In [ ]:
# Download a specific lexicon for the sentiment analysis in the next lecture
nltk.download('vader_lexicon')

# Opens the interface to download all the other corpora
nltk.download()

An important part of processing natural language data is normalizing this data by removing variations in the text that the computer naively thinks are different entities but humans recognize as being the same. There are several steps to this including case adjustment (House to house), tokenizing (finding individual words), and stemming/lemmatization ("tried" to "try").

This figure is a nice summary of the process of pre-processing your text data. The HTML to ASCII data step has already been done with the get_page_content function in the Appendix.

In the case of case adjustment, it turns out several of the different "words" in the corpus are actually the same, but because they have different capitalizations, they're counted as different unique words.

Counting words

How many words are in President Fillmore's article?

A biography can be represented as a single large string (as it is now), but this huge string is not very helpful for analyzing features of the text until the string is segmented into "tokens", which include words but also hyphenated phrases or contractions ("aren't", "doesn't", etc.)

There are a variety of different segmentation/tokenization strategies (with different tradeoffs) and corresponding methods implemented in NLTK.

We could employ a naive approach of splitting on spaces. This turns out to create words out of happenstance punctuation.


In [ ]:
example_ws_tokens = example.split(' ')
print("There are {0:,} words when splitting on white spaces.".format(len(example_ws_tokens)))
example_ws_tokens[:25]

We could use regular expressions to split on repeated whitespaces.


In [ ]:
example_re_tokens = re.split(r'\s+',example)
print("There are {0:,} words when splitting on white spaces with regular expressions.".format(len(example_re_tokens)))
example_re_tokens[0:25]

It's clear we want to separate words based on other punctuation as well so that "Darkness," and "Darkness" aren't treated like separate words. Again, NLTK has a variety of methods for doing word tokenization more intelligently.

word_tokenize is probably the easiest-to-recommend


In [ ]:
example_wt_tokens = nltk.word_tokenize(example)
print("There are {0:,} words when splitting on white spaces with word_tokenize.".format(len(example_wt_tokens)))
example_wt_tokens[:25]

But there are others like wordpunct_tokenize tha makes different assumptions about the language.


In [ ]:
example_wpt_tokens = nltk.wordpunct_tokenize(example)
print("There are {0:,} words when splitting on white spaces with wordpunct_tokenize.".format(len(example_wpt_tokens)))
example_wpt_tokens[:25]

Or Toktok is still another word tokenizer.


In [ ]:
toktok = nltk.ToktokTokenizer()
example_ttt_tokens = toktok.tokenize(example)
print("There are {0:,} words when splitting on white spaces with TokTok.".format(len(example_ttt_tokens)))

example_ttt_tokens[:25]

There are a variety of strategies for splitting a text document up into its constituent words, each making different assumptions about word boundaries, which results in different counts of the resulting tokens.


In [ ]:
for name,tokenlist in zip(['space_split','re_tokenizer','word_tokenizer','wordpunct_tokenizer','toktok_tokenizer'],[example_ws_tokens,example_re_tokens,example_wt_tokens,example_wpt_tokens,example_ttt_tokens]):
    print("{0:>20}: {1:,} total tokens, {2:,} unique tokens".format(name,len(tokenlist),len(set(tokenlist))))

Word cases

Remember that strings of different cases (capitalizations) are treated as different words: "young" and "Young" are not the same. An important part of text processing is to remove un-needed variation, and mixed cases are variation we generally don't care about.


In [ ]:
example_wpt_lowered = [token.lower() for token in example_wpt_tokens]
unique_wpt = len(set(example_wpt_tokens))
unique_lowered_wpt = len(set(example_wpt_lowered))
difference = unique_wpt - unique_lowered_wpt

print("There are {0:,} unique words in example before lowering and {1:,} after lowering,\na difference of {2} unique tokens.".format(unique_wpt,unique_lowered_wpt,difference))

Stop words

English, like many languages, repeats many words in typical language that don't always convey a lot of information by themselves. When we do text processing, we should make sure to remove these "stop words".


In [ ]:
nltk.FreqDist(example_wpt_lowered).most_common(25)

NLTK helpfully has a list of stopwords in different languages.


In [ ]:
english_stopwords = nltk.corpus.stopwords.words('english')
english_stopwords[:10]

We can also use string module's "punctuation" attribute as well.


In [ ]:
list(string.punctuation)[:10]

Let's combine them so get a list of all_stopwords that we can ignore.


In [ ]:
all_stopwords = english_stopwords + list(string.punctuation) + ['–']

We can use a list comprehension to exclude the words in this stopword list from analysis while also gives each word similar cases. This is not perfect, but an improvement over what we had before.


In [ ]:
wpt_lowered_no_stopwords = []

for word in example_wpt_tokens:
    if word.lower() not in all_stopwords:
        wpt_lowered_no_stopwords.append(word.lower())

fdist_wpt_lowered_no_stopwords = nltk.FreqDist(wpt_lowered_no_stopwords)
fdist_wpt_lowered_no_stopwords.most_common(25)

The distribution of word frequencies, even after stripping out stopwords, follows a remarkable strong pattern. Most terms are used infrequently (upper-left) but a handful of terms are used repeatedly! Zipf's law states:

"the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc."


In [ ]:
freq_counter = Counter(fdist_wpt_lowered_no_stopwords.values())

f,ax = plt.subplots(1,1)

ax.scatter(x=list(freq_counter.keys()),y=list(freq_counter.values()))
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('Term frequency')
ax.set_ylabel('Number of terms')

Lemmatization

Lemmatization (and the related concept of stemming) are methods for dealing with conjugated words. Word like "ate" or "eats" are counted as distinct from "eat", although semantically they are similar and should likely be grouped together. Where stemming just removes commons suffixes and prefixes, sometimes resulting in mangled words, lemmatization attempts returns the root word. However, lemmatization can be extremely expensive computationally, which does not make it a good candidate for large corpora.

The get_wordnet_pos and lemmatizer functions below work eith each other to lemmatize a word to its root. This involves attempting to discover the part-of-speech (POS) for each word and passing this POS to NLTK's lemmatize function, ultimately returning the root word (if it exists in the "wordnet" corpus).


In [ ]:
from nltk.corpus import wordnet

from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()

def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN
    
def lemmatizer(token):
    token,tb_pos = nltk.pos_tag([token])[0]
    pos = get_wordnet_pos(tb_pos)
    lemma = wnl.lemmatize(token,pos)
    return lemma

Loop through all the tokens in wpt_lowered_no_stopwords, applying the lemmatizer function to each. Then inspect 25 examples of words where the lemmatizer changed the word length.


In [ ]:
wpt_lemmatized = [lemmatizer(t) for t in wpt_lowered_no_stopwords]
[(i,j) for (i,j) in list(zip(wpt_lowered_no_stopwords,wpt_lemmatized)) if len(i) != len(j)][:25]

Pulling the pieces all together

We can combine all this functionality together into a single function text_preprocessor that takes a large string of text and returns a list of cleaned tokens, stripped of stopwords, lowered, and lemmatized.


In [ ]:
def text_preprocessor(text):
    """Takes a large string (document) and returns a list of cleaned tokens"""
    tokens = nltk.wordpunct_tokenize(text)
    clean_tokens = []
    for t in tokens:
        if t.lower() not in all_stopwords and len(t) > 2:
            clean_tokens.append(lemmatizer(t.lower()))
    return clean_tokens

We can apply this function to every presidential biography (this may take a minute or so) and write the resulting list of cleaned tokens to the "potus_wiki_bios_cleans.json" file. We'll use this file in the next lecture as well.


In [ ]:
# Clean each bio
cleaned_bios = {}

for bio_name,bio_text in bios.items():
    cleaned_bios[bio_name] = text_preprocessor(bio_text)

# Save to disk
with open('potus_wiki_bios_cleaned.json','w') as f:
    json.dump(cleaned_bios,f)

Comparative descriptive statistics

Now that we have cleaned biographies for each president, we can perform some basic analyses of the text. Which presidents have the longest biographies?


In [ ]:
potus_total_words = {}

for bio_name,bio_text in cleaned_bios.items():
    potus_total_words[bio_name] = len(bio_text)
    
pd.Series(potus_total_words).sort_values(ascending=False)

How many unique words?


In [ ]:
potus_unique_words = {}

for bio_name,bio_text in cleaned_bios.items():
    potus_unique_words[bio_name] = len(set(bio_text))
    
pd.Series(potus_unique_words).sort_values(ascending=False)

The lexical diversity is the ratio of unique words to total words. Values closer to 0 indicate the presence of repeated words (low diversity) and values closer to 1 indicate words used only once (high diversity).


In [ ]:
def lexical_diversity(token_list):
    unique_tokens = len(set(token_list))
    total_tokens = len(token_list)
    if total_tokens > 0:
        return unique_tokens/total_tokens
    else:
        return 0

In [ ]:
potus_lexical_diversity = {}

for bio_name,bio_text in cleaned_bios.items():
    potus_lexical_diversity[bio_name] = lexical_diversity(bio_text)
    
pd.Series(potus_lexical_diversity).sort_values(ascending=False)

We can count how often a word occurs in each biography.


In [ ]:
# Import the Counter function
from collections import Counter

# Get counts of each token from the cleaned_bios for Grover Cleveland
cleveland_counts = Counter(cleaned_bios['Grover Cleveland'])

# Convert to a pandas Series and sort
pd.Series(cleveland_counts).sort_values(ascending=False).head(25)

In [ ]:
potus_word_counts = {}

for bio_name,bio_text in cleaned_bios.items():
    potus_word_counts[bio_name] = Counter(bio_text)
    
potus_word_counts_df = pd.DataFrame(potus_word_counts).T

potus_word_counts_df.to_csv('potus_word_counts.csv',encoding='utf8')

print("There are {0:,} unique words across the {1} presidents.".format(potus_word_counts_df.shape[1],potus_word_counts_df.shape[0]))

Which words occur the most across presidential biographies?


In [ ]:
potus_word_counts_df.sum().sort_values(ascending=False).head(20)

Case study: Preprocess the S&P500 articles and compute statistics

Step 1: Load the "sp500_wiki_articles.json", use the text_preprocessor function (or some other sequence of functions) from above to clean these articles up, and save the cleaned content to "sp500_wiki_articles_cleaned.json".


In [ ]:

Step 2: Compute some descriptive statistics about the company articles with the most words, most unique words, greatest lexical diversity, most used words across articles, and number of unique words across all articles.


In [ ]:

Appendix 1: Retrieving Wikipedia content by category

Functions and operations to scrape the most recent (10 August 2018) Wikipedia content from every member of "Category:Presidents of the United States".

The get_page_content function will get the content of the article as HTML and parse the HTML to return something close to a clean string of text. The get_category_subcategories and get_category_members will get all the members of a category in Wikipedia.


In [ ]:
def get_page_content(title,lang='en',redirects=1):
    """Takes a page title and returns a (large) string of the HTML content 
    of the revision.
    
    title - a string for the title of the Wikipedia article
    lang - a string (typically two letter ISO 639-1 code) for the language 
        edition, defaults to "en"
    redirects - 1 or 0 for whether to follow page redirects, defaults to 1
    parse - 1 or 0 for whether to return the raw HTML or paragraph text
    
    Returns:
    str - a (large) string of the content of the revision
    """
    
    bad_titles = ['Special:','Wikipedia:','Help:','Template:','Category:','International Standard','Portal:','s:','File:','Digital object identifier','(page does not exist)']
    
    # Get the response from the API for a query
    params = {'action':'parse',
          'format':'json',
          'page':title,
          'redirects':redirects,
          'prop':'text',
          'disableeditsection':1,
          'disabletoc':1
         }

    url = 'https://{0}.wikipedia.org/w/api.php'.format(lang)
    req = requests.get(url,params=params)

    json_string = json.loads(req.text)
    
    new_title = json_string['parse']['title']
    
    if 'parse' in json_string.keys():
        page_html = json_string['parse']['text']['*']

        # Parse the HTML into Beautiful Soup
        soup = BeautifulSoup(page_html,'lxml')
        
        # Remove sections at end
        bad_sections = ['See_also','Notes','References','Bibliography','External_links']
        sections = soup.find_all('h2')
        for section in sections:
            if section.span['id'] in bad_sections:
                
                # Clean out the divs
                div_siblings = section.find_next_siblings('div')
                for sibling in div_siblings:
                    sibling.clear()
                    
                # Clean out the ULs
                ul_siblings = section.find_next_siblings('ul')
                for sibling in ul_siblings:
                    sibling.clear()
        
        # Get all the paragraphs
        paras = soup.find_all('p')
        
        text_list = []
        
        for para in paras:
            _s = para.text
            # Remove the citations
            _s = re.sub(r'\[[0-9]+\]','',_s)
            text_list.append(_s)
        
        final_text = '\n'.join(text_list).strip()
        
        return {new_title:final_text}

def get_category_subcategories(category_title,lang='en'):
    """The function accepts a category_title and returns a list of the category's sub-categories
    
    category_title - a string (including "Category:" prefix) of the category name
    lang - a string (typically two letter ISO 639-1 code) for the language edition,
        defaults to "en"
    
    Returns:
    members - a list containing strings of the sub-categories in the category
    
    """
    # Replace spaces with underscores
    category_title = category_title.replace(' ','_')
    
    # Make sure "Category:" appears in the title
    if 'Category:' not in category_title:
        category_title = 'Category:' + category_title
        
    _S="https://{1}.wikipedia.org/w/api.php?action=query&list=categorymembers&cmtitle={0}&cmtype=subcat&cmprop=title&cmlimit=500&format=json&formatversion=2".format(category_title,lang)
    json_response = requests.get(_S).json()

    members = list()
    
    if 'categorymembers' in json_response['query']:
        for member in json_response['query']['categorymembers']:
            members.append(member['title'])
            
    return members
    
def get_category_members(category_title,depth=1,lang='en'):
    """The function accepts a category_title and returns a list of category members
    
    category_title - a string (including "Category:" prefix) of the category name
    lang - a string (typically two letter ISO 639-1 code) for the language edition,
        defaults to "en"
    
    Returns:
    members - a list containing strings of the page titles in the category
    
    """
    # Replace spaces with underscores
    category_title = category_title.replace(' ','_')
    
    # Make sure "Category:" appears in the title
    if 'Category:' not in category_title:
        category_title = 'Category:' + category_title
    
    _S="https://{1}.wikipedia.org/w/api.php?action=query&list=categorymembers&cmtitle={0}&cmprop=title&cmnamespace=0&cmlimit=500&format=json&formatversion=2".format(category_title,lang)
    json_response = requests.get(_S).json()

    members = list()
    
    if depth < 0:
        return members
    
    if 'categorymembers' in json_response['query']:
        for member in json_response['query']['categorymembers']:
            members.append(member['title'])
            
    subcats = get_category_subcategories(category_title,lang=lang)
    
    for subcat in subcats:
        members += get_category_members(subcat,depth-1)
            
    return members

Use get_category_members to get all the immediate members (depth=0) of "Category:Presidents of the United States."


In [ ]:
presidents = get_category_members('Presidents_of_the_United_States',depth=0)
presidents

Loop through the fourth through rest of the presidents list and get each president's biography using get_page_content. Store the results in the presidents_wiki_bios dictionary.


In [ ]:
presidents_wiki_bios = {}
for potus in presidents[3:]:
    presidents_wiki_bios.update(get_page_content(potus))

Save the data to a JSON file.


In [ ]:
with open('potus_wiki_bios.json','w') as f:
    json.dump(presidents_wiki_bios,f)

Appendix 2: Retrieving Wikipedia content from a list

Wikipedia maintains a (superficially) up-to-date List of S&P 500 companies, but not a category of the constituent members. Like the presidents, we want to retrieve a list of all their Wikipedia articles, parse their content, and perform some NLP tasks.

First, get the content of the article so we can parse out the list.


In [ ]:
title = 'List of S&P 500 companies'
lang = 'en'
redirects = 1

params = {'action':'parse',
          'format':'json',
          'page':title,
          'redirects':1,
          'prop':'text',
          'disableeditsection':1,
          'disabletoc':1
         }

url = 'https://en.wikipedia.org/w/api.php'
req = requests.get(url,params=params)

json_string = json.loads(req.text)

if 'parse' in json_string.keys():
    page_html = json_string['parse']['text']['*']

    # Parse the HTML into Beautiful Soup
    soup = BeautifulSoup(page_html,'lxml')

The hard way to get a list of the company names out is parsing the HTML table. We:

  1. Find all the tables in the soup
  2. Get the first table out
  3. Find all the rows in the table
  4. Loop through each row
  5. Find the links in each row
  6. Get the second link's title in each row
  7. Add the title to company_names

In [ ]:
company_names = []

# Get the first table
component_stock_table = soup.find_all('table')[0]

# Get all the rows after the first (header) row
rows = component_stock_table.find_all('tr')[1:]

# Loop through each row and extract the title
for row in rows:
    # Get all the links in a row
    links = row.find_all('a')
    # Get the title in the 2nd cell from the left
    title = links[1]['title']
    # Add it to the company_links
    company_names.append(title)

print("There are {0:,} titles in the list".format(len(set(company_names))))

The easy eay is to use use pandas's read_html function to parse the table into a DataFrame and access the "Security" (second) column.


In [ ]:
company_df = pd.read_html(str(component_stock_table),header=0)[0]
company_df.head()

In [ ]:
company_names = company_df['Security'].tolist()

Now we can use get_page_content to get the content of each company's page and add it to the sp500_articles dictionary.


In [ ]:
sp500_articles = {}

for company in set(company_links):
    sp500_articles.update(get_page_content(company))

Save the data to a JSON file.


In [ ]:
with open('sp500_wiki_articles.json','w') as f:
    json.dump(sp500_articles,f)